06 Jul 2025

GPU cluster monitoring

History / Edit / PDF / EPUB / BIB / 1 min read (~164 words)
Machine learning

In this article I list the various metrics/alerts one should have when monitoring a GPU cluster to ensure efficient usage.

  • Allocated GPUs are used
    • Used to detect jobs that may ask multiple GPUs but end up using 1 or only a few of them
  • GPU utilization below threshold (e.g., 10%)
    • Used to detect workloads that do not make full use of the GPU or are allocated to an oversized GPU
  • GPU utilization above threshold (90%)
    • Used to detect when the GPU is saturated
  • GPU utilization range
    • Used to detect uneven distribution of GPU compute workload
  • GPU memory utilization above threshold (e.g., 10%)
    • Used to detect workloads that do not make full use of the GPU or are allocated to an oversized GPU
  • GPU memory utilization above threshold (e.g., 95%)
    • Used to detect when a job is about to run out of GPU memory
  • If using InfiniBand
    • InfiniBand receive/transmit > 0 when running multi-node workloads
    • Used to identify workloads that are not properly configured to use InfiniBand

What data do I need to do time series forecasting?

There are three values that you must know for each data point of your time series:

  • its entity, which represents a unique value identifying the time series (e.g., a product SKU). Without this information, it is not possible to construct a sequence of points since there's no logical grouping between the points.
  • its timestamp, which represents the moment in time the data point was recorded. Without this information, it is not possible to construct a sequence of points since there's no sequential ordering between the points.
  • its target, which represents the measurement of the data point itself that we want to predict. Without this information, we have effectively nothing to base ourselves on.

Such information would look as follow when organized in a table:

Entity Timestamp Target
A 1 5
A 2 6
A 3 7
B 1 13
B 2 27
B 3 55

Additionally, you may also have recorded additional values at the same time, which can be a useful source of information when trying to predict a time series.

Entity Timestamp Target Value 1
A 1 5 3
A 2 6 2
A 3 7 1
B 1 13 47
B 2 27 33
B 3 55 5

Let see what happened if we removed each of these columns to illustrate their necessity.

Timestamp Target
1 5
2 6
3 7
1 13
2 27
3 55

Removing the entity effectively leaves us with two values for the same timestamp. If the data was in this format and we were told that each time the timestamp goes below its previous value a new entity was defined, we would be able to reconstruct the initial table with its entity column.

Entity Target
A 5
A 6
A 7
B 13
B 27
B 55

Removing the timestamp gives us the values the entity may take, but we don't know when. Again, if we're told that the rows have been kept in some order, we could reconstruct the timestamp column.

Entity Timestamp
A 1
A 2
A 3
B 1
B 2
B 3

Removing the target column makes this problem impossible to solve. We're left with only the entities that were measured and the time of measurement, but no measurement, which makes the two other values useless.

Is it possible to extract someone's beliefs by reading their writings?

Yes.

If someone has written a blog for example, it is possible to figure out a variety of information about them simply based on what is written in those articles.

If someone writes about a specific software, we may not be able to infer right away what they think about said software, but we know that they've spent enough time to learn a bit about this specific software. If we know all the other alternatives in this category of software, we could infer that the user believes that this software is possibly better than the alternatives, otherwise why would they have picked it?

If the writer writes a lot on a topic, that is also a clue about their beliefs. They probably think that this topic is important, hence why they write about it. Maybe they write about this topic because it is lucrative to them.

What the writer doesn't write about is also informative. If they write mostly about technology, maybe they don't care about politics or sports?

It is possible to extract if-then rules from their writings, which generally expresses some form of believe that if something then something else. There are other variants where only if is provided (if something, something else, or something else, if something).

A writer may use certain adjectives to describe things as "easy", "simple", "straightforward", "difficult", "impossible", "hard", etc. Those are also useful of indicators of the writer's beliefs.

Nowadays with LLMs it's easy to provide a prompt such as "What beliefs are expressed in the following text" or "Extract the beliefs in the following text". With a bit of scripting we can extract everything available on the blog and feed it to a LLM to let it extract the beliefs for us.

I've been given a dataset and I need to assess its quality.

Use Pandas Profiling to quickly generate a document that will provide you with a first overview of the data.

Your first step should be to look for warnings and messages at the top of the document. Look for entries about missing values, those will point you to variables that may need attention during the data cleaning and data imputation phases of your machine learning problem. As you are doing an assessment, simply indicate that data is missing in these variables and then see if you can determine why by looking at a few examples by loading the data in a pandas dataframe.

Are there a lot of duplicated rows? Depending on the data you've been provided, this may help you identify whether or not something is wrong with the data you were provided. If all entries are supposed to be unique because they represent a single (entity, timestamp, target) tuple, then you should ask yourself why it isn't the case. Is it possible that the dataset was created by appending a collection of other documents, leading to duplicate lines? If so, you may have to do some dataset preprocessing in order to get rid of duplicate rows.

Look for variables that are indicated as highly correlated with other variables. High correlation means that it may be possible that one variable has exactly (or almost) the same values as the other variable, which would provide little information to a machine learning model. It would also mean that picking one variable out of two correlated variables would avoid the cost of storing both.

Look at each variable in turn and view its details.

Look at the distribution of values. Are they uniformly distributed, normally distributed, binomially distributed, etc.?

If there are only two possible values for a variable, are those values approximately the same or one value is dominant compared to the other? Were you to try and predict this variable, you would have to deal with class imbalance.

Are the values of the variables sensible to you? Are variables composed of multiple information, such as the value and the unit used for the measurement? You would generally prefer composite values to be separated into different variables as it will be easier to process using machine learning models.

When looking at numbers distribution, are there outliers (values that are either a lot smaller or larger than the rest)? It is sometimes important to ask those who provided you with the data if they can explain those outliers. In general you will want to ignore outliers during training as they may skew your model toward them, resulting in less than ideal results for all the other data points.

The quality of a dataset is inversely proportional to the number of operations you need to apply to it to make it a clean dataset. That is to say that if you don't need to do anything on the data provided to you, then it is a good dataset.

What is learning according to machine learning?

It is (for supervised learning) looking at numerous samples, decomposing them into input variables and their associated target variable, and deriving according to an algorithm how to predict the target variable given input variables.

It is the (potentially lossy) compression of the observed samples, where the learning algorithm describes the compression/decompression algorithm. The compressed data is the information necessary for the algorithm to make predictions (decompression).

It is the creation of some "memory" of the observed samples. Whereas an untrained model has no memory of the dataset since it hasn't seen the data, a trained model has some form of memory. A simple model such as sklearn's DummyRegressor will learn and memorize the mean of the target variable. It may not have learned and memorized much, but it has built its internal model of the data.

It is to imitate as closely as possible the source of data it is trained on. This means that given input variables, it should produce target values that are as close as possible to those observed during training (learning).